03/29/2021

Getting Started

Setup

Welcome! While we’re waiting:

Introduction

  • About me

  • About you

    • Your familiarity with US Census data
    • with geospatial data
    • with geospatial data in R

Outline

  • Describe primary Census data products

  • Introduce R packages for working with Census Data

  • Use those packages to fetch census data

  • Use those packages to fetch census data plus census geograpic boundary files

  • Make maps of census data

Census Data Overview

US Census Data

The “nation’s leading provider of quality data about its people and economy.”

Available at www.census.gov

Primary Census Products

  • Decennial Census

  • American Community Survey (ACS)

Decennial Census

Complete count of the population every 10 years since 1790

Includes data on

  • population, by age & race/ethnicity

  • housing, by occupancy & tenure (owned, rented)

From 1840 - 2000, additional questions were asked of a sample of the population.

Since 2005 those sample questions now comprise the American Community Survey (ACS).

American Community Survey (ACS)

  • Annual survey of a sample of about 3.5 million households

  • Provides estimates of demographic, social, economic & housing characteristics

  • Includes margin of error values for the estimates.

Decennial Census* vs ACS Data

Demographic* Social Economic Housing
Sex Families Income Tenure*
Age Education Benefits Occupancy*
Race Marital Status Employment Status Structure Type
Hispanic Origin Fertility Occupation Housing Value
Grandparents Industry Taxes & Insurance
Veterans Commuting Utilities
Disability Status Place of Work Mortgage
Language at Home Health Insurance Monthly Rent
Citizenship
Mobility

Census Geographies

Census microdata (data collected from individuals) are publicly available at one or more levels of geographic aggregation. Not all data tables are available all geographies, e.g., only decennial data census are available at the block level.

Census Data & Census Geographies

ACS Data Products

ACS 1 year and 5 year products are currently available through 2019

  • new data is released out at the end of the next year (2020 data in 2021)

ACS 3 year no longer available (2008 - 2013)

ACS 5 year data provides much better estimates, lower margins of error

  • More data tables are available for ACS 5 Year product

See: Census ACS: Guidance for Data Users

Census Data Workflow

Identify your

  • Topic of interest
  • Dataset: Decennial census or ACS?
  • Year(s)
  • Tabulation unit of aggregation (county, tract, etc)
  • Geographic filter: for what specific locations?

Then determine what specific tables and variables are available

CAUTION

“If you want to measure change you can’t change the measures!”

Census tables, variables, geographies, and geographic boundaries change over time!

Measuring change over time with census data is its own thing, complex and not covered by this workshop!

Accessing Census Data

Census APIs

You can write code to fetch data from the Census Web APIs

  • API: application programming interface

  • Web API: URLs can be formatted to make queries that return data

Or you can leverage an existing R package to make this easier!

  • That’s what we will do!

Only a subset of recent Census data products are available via APIs

R Packages for Working with Census Data

R Packages for Working with Census Data

tidycensus & tigris

tidycensus

An R package with functions that make it easier to fetch decennial census and ACS data from the Census APIs.

  • Limited available from Census

    • decennial census: 1990, 2000, and 2010
    • ACS 1 yr: 2005 through 2019
    • ACS 5 yr: 2005-2009 through 2015-2019 are available.
      • Note: tidycensus refers to ACS 5 year datasets by the endyear, e.g., 2009 or 2019.
      • latest available year is the default year for tidycensus functions
  • actively maintained and expanding to include more census data products (see tidycensus website)

Requesting a Census API key

tigris

Provides access to Census geographic data files

  • detailed TIGER/Line boundary files or
  • simplified cartographic boundary files (default)

Also provides access to additional geographic data,

  • eg, rivers, roads, coastlands, landmarks, and more

Used by tidycensus to access state, county, tract, block group, block, and ZCTA boundaries.

tidycensus & tigris

tidycensus tutorials

tidyverse

A collection of R Packages for data science, developed primarily by Hadley Wickham, Chief Scientist at RStudio, including:

  • dplyr and tidyr for reshaping data

  • ggplot2 for plotting

  • purr, readr and tibble for improved performance

These packages and more are used by tidyverse under the hood.

sf package

Simple features for geospatial data objects and methods.

  • Next generation R package for working with vector geospatial data
    • supercedes the sp package

sf includes the functionality of the sp, rgdal, rgeos and proj4 packages.

  • but with improved performance, simplified command syntax, and easier workflows.

sf is loaded and used automatically by tidycensus

mapview

mapview provides functions for quickly and easily create interactive mapping visualizations.

Tutorial Time!

Part 1

We will work through several exercises using tidycensus to fetch, wrangle and map census data.

Install packages

Install any packages we will use that are not installed already. If you installed any of these awhile ago it’s a good idea to install updates!

# A list of the packages we will use
list_of_packages <- c("tidyverse","tidycensus","tigris","sf","mapview")

# identify the ones we need to install
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]

# install any that are not installed (new_packages)
if(length(new_packages) > 0) {
  print(paste("Installing these packages:", new_packages))
  install.packages(new_packages)
} else {
  print("All packages already installed!")
}

Loading packages

Load the packages we will use today

library(tidycensus)
library(tidyverse) 
library(tigris)
library(sf)
library(mapview)

If you are getting errors try importing dplyr or reinstalling dplyr package as that has worked for some.

Census API Key

Install your Census API Key

Use the tidycensus function census_api_key to register your API key with tidycensus

# Install your census api key - long alphanumeric string
census_api_key("THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS")

Install your Census API Key

I keep my key in a file so no one can see it

# source (run) an r script that creates a variable with my key
source("/Users/pattyf/Documents/Dlab/workshops/keys/census_api_key.R")

#register the key
census_api_key(my_census_api_key)
## To install your API key for use in future sessions, run this function with `install = TRUE`.

Set working directory

Fetching Decennial Census Data

Population Data

Let’s start by fetching population data from the 2010 Census for all states

In order to fetch census data you need to identify the census variables that contain the data of interest.

Topics, Tables & Variables

Census data variables are organized in tables

Which are organized by topic or concept.

The tidycensus load_variables function can help with this step.

First, take a look at the function documentation.

?load_variables

load_variables

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS-5yr
                           dataset = 'sf1',  # 'sf1' for decennial or 'acs5', etc
                           cache = TRUE)     # Whether to save fetched data locally

Decennial Census Variables

Let’s take a look at and discuss the resultant dataframe.

View(vars2010)

2010 Decennial Census Tables

  • Topics: Population, housing

  • 3,346 Variables: 3,346

  • 333 Tables - that’s a lot!

    • 177 population tables (identified with a ‘‘P’’) available to the block level
    • 58 housing tables (identified with an ‘‘H’’) available to the block level
    • 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additoinal PCT tables

https://www.census.gov/data/datasets/2010/dec/summary-file-1.html

What Variable has the 2010 Total Population value?

We can sort and filter the vars2010 dataframe to find it.

get_decennial

We can use the tidycensus function get_decenial to fetch the 2010 census data for total population by state.

First, check the documentation for the function.

?get_decennial

get_decennial

Fetch total population by state (P001001) from the 2010 census using get_decennial.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

View the Data

  • How many rows and columns?

  • Do you see the expected number of states?

  • What column contains the population counts?

  • Do the data values see to be right?

head(pop2010)
tail(pop2010)

head(pop2010)
## # A tibble: 6 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 01    Alabama    P001001   4779736
## 2 02    Alaska     P001001    710231
## 3 04    Arizona    P001001   6392017
## 4 05    Arkansas   P001001   2915918
## 5 06    California P001001  37253956
## 6 22    Louisiana  P001001   4533372
tail(pop2010)
## # A tibble: 6 x 4
##   GEOID NAME          variable   value
##   <chr> <chr>         <chr>      <dbl>
## 1 51    Virginia      P001001  8001024
## 2 53    Washington    P001001  6724540
## 3 54    West Virginia P001001  1852994
## 4 55    Wisconsin     P001001  5686986
## 5 56    Wyoming       P001001   563626
## 6 72    Puerto Rico   P001001  3725789

Visualize results

We can visualize the data to get a quick overview of the distribution of data values.

It’s a first step in exploratory data analysis and a last step in data communication.

ggplot2 is the most commonly used R package for data visualization.

  • It is loaded when you load the tidyverse package.

Let’s use it to visualize the population data.

Plot 2010 Population by state

Use ggplot2 to create an ordered horizontal bar chart.

pop_plot<- ggplot(data=pop2010, aes(x=reorder(NAME,value), y=value/1000000)) + 
  geom_bar(stat="identity") + coord_flip() +
  theme_minimal() + 
  labs(title = "2010 US Population by State") +
  xlab("State") +
  ylab("in millions")

Display the plot

Challenge

Fetch total population data by state from the 2000 decennial census.

Don’t assume variable names are the same across years.

Check first by loading the 2000 variables into a dataframe.

Challenge Solution

Total Population in 2000

# What is the variable name in 2000?
vars2000 <- load_variables(year=2000, dataset = 'sf1', cache = T)

# Take a look and search in the dataframe
View(vars2000)

# Fetch the 2000 pop data
pop2000 <- get_decennial(geography = "state", variables = "P001001", year = 2000)

# Take a look  
View(pop2000)

Limiting by Area of Interest

In the previous example we retrieved population data for all states.

  • This is the default behavior if you don’t specify a subset.

  • But you can limit the data to be retrieved by subunits like state.

Limit Areas of Interest

Let’s fetch data for just 3 states.

state_pop2010 <- get_decennial(geography = "state", # census tabulation unit
                         variables = "P001001",     # variables of interest
                         year = 2010,               # census year
                         state=c("CA","OR","WA"))   # Filter by states of interest
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Note we are referencing states by their abbreviation.

View Results

state_pop2010
## # A tibble: 3 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 06    California P001001  37253956
## 2 41    Oregon     P001001   3831074
## 3 53    Washington P001001   6724540

Changing Census Tabulation unit

get_decennial accepts a number of different values for tabulation unit.

  • Options include: state, county, tract, block group, block, and ZCTA.

Let’s change the tabulation unit from state to county.

county_pop2010 <- get_decennial(geography = "county", # census tabulation unit
                            variables = "P001001",    # variable(s) of interest
                            year = 2010)              # data year - only one!
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Changing Census Tabulation unit

View the county data to see what was retrieved.

View(county_pop2010)

Challenge

  • Fetch population by county for Oregon & California

Try it before you look ahead at solutions.

Challenge Solution

## Fetch population by **county** for Oregon & California
county_pop2010_ca_and_or <- get_decennial(geography = "county",   # census tabulation unit
                                 variables = "P001001",  # variables of interest
                                 year = 2010,
                                 state=c('CA','OR'))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
#head(county_pop2010_ca_and_or)

Census tract data

Census tracts are the most commonly used census tabulation unit.

Let’s fetch population data for the census tabulation unit to tract

Census Tract Data

Fetch total population for all states by census tract

## Fetch population by **tract** for all states.
pop2010_tracts <- get_decennial(geography = "tract",    # census tabulation unit
                                variables = "P001001",  # variables of interest
                                year = 2010)

Census Tract Data

Fetch total population for California by census tract

## Fetch population by **tract** for California.
cal_pop2010_tracts <- get_decennial(geography = "tract",       # census tabulation unit
                                       variables = "P001001",  # variables of interest
                                       year = 2010,
                                       state=c('CA'))      # State filter

Fetching Census Tract Data

If you want census data at the tract level or below you must specifiy the state(s)

  • You can also specify one or more counties
tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year - only one!
                         state="CA",                  # limit to California
                         county=c("Alameda","Contra Costa"))  # & these counties
## Getting data from the 2010 decennial Census

Fetching Census Tract Data

View the results! How many census tracts are in these 3 counties?

dim(tract_pop2010)
View(tract_pop2010)

Using FIPS codes

You can use names, abbreviations or FIPS codes for your state and county.

# County FIPS Codes for
# Alameda, SF, Contra Costa, Marin County, Napa, 
# San Mateo, Santa Clara,  Solano,  Sonoma, santa cruz
nine_counties <- c("001", "075", "013", "041", "055", "081", "085", "095", "097")

# Fetch population by **tract** for the nine county Bay Area
bayarea_pop2010_tract <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year
                         state="CA",                  # limit to state of California
                         county=nine_counties)  # and only these counties
# View results
# View(bayarea_pop2010_tract)

Any QUESTIONS

Fetching data for more than one census variable

What three things are new here?

#urban and rural pop for 3 CA counties
ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Fetching data for more than one census variable

  1. You can specify more than one variable:
variables = c("P002002","P002005")
  1. You can rename the values in the output ‘variable’ column.
variables = c(urban="P002002",rural="P002005")
  1. You can identify a summary_var (a denominator - here, the total count of all people or households surveyed. Can be used for calcuations like percent of total.)
summary_var = "P002001"

Take a look at the results

ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable  value summary_value
##   <chr> <chr>                        <chr>     <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110         87841
## 2 06055 Napa County, California      urban    118194        136484
## 3 06097 Sonoma County, California    urban    424102        483878
## 4 06045 Mendocino County, California rural     39731         87841
## 5 06055 Napa County, California      rural     18290        136484
## 6 06097 Sonoma County, California    rural     59776        483878

Calculating Percents

The summary_value column comes in handy when you want to compute percent of total, for example:

# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))

Calculating Percents

Let’s take a look at the output

ur_pop10 # Take a look
## # A tibble: 6 x 6
##   GEOID NAME                         variable  value summary_value   pct
##   <chr> <chr>                        <chr>     <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110         87841  54.8
## 2 06055 Napa County, California      urban    118194        136484  86.6
## 3 06097 Sonoma County, California    urban    424102        483878  87.6
## 4 06045 Mendocino County, California rural     39731         87841  45.2
## 5 06055 Napa County, California      rural     18290        136484  13.4
## 6 06097 Sonoma County, California    rural     59776        483878  12.4

Plot it

Plots give us compact visual summaries of the data

myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

Don’t worry if you don’t get all the ggplot code now. It’s here for reference.

Plot it

myplot

Fetch all the data in one table

This is often helpful but you need to keep tract of the meaning of each variable.

  • You can go back to the vars2010 and filter by the table id to check
alco_pop10 <- get_decennial(geography = "tract", # Census tabulation unit
                           table =  "P002",      # Table of urban & rural population counts
                           year = 2010,          # Decennial census year
                           state='CA',           # Filter state
                           county="Alameda")     # Filter county
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look

unique(alco_pop10$variable) # What and how many unique vars in table?
## [1] "P002001" "P002002" "P002003" "P002004" "P002005" "P002006"
head(alco_pop10,3)  # Take a look at output
## # A tibble: 3 x 4
##   GEOID       NAME                                          variable value
##   <chr>       <chr>                                         <chr>    <dbl>
## 1 06001400100 Census Tract 4001, Alameda County, California P002001   2937
## 2 06001400200 Census Tract 4002, Alameda County, California P002001   1974
## 3 06001400300 Census Tract 4003, Alameda County, California P002001   4865

Output options

Let’s try all three of these commands and then look at the ouput to see what’s different?

get_decennial(geography = "state", variables = "P001001",
              year = 2010)

get_decennial(geography = "state", variables = c(pop10="P001001"),
              year = 2010)

get_decennial(geography = "state", variables = c(pop10="P001001"),
              year = 2010, output="wide")

Data Wrangling

Your R skills can help you reformat the data and make it more useable.

Let’s fetch population data for 2010 & 2000 by state with output=wide.

  • We will label the variables pop00 and pop10.

Then we will combine these into one data frame.

Data Wrangling

Fetch pop by state from both the 2000 and 2010 census

pop2000 <- get_decennial(geography = "state",
                         variables = c(pop00="P001001"), 
                         year = 2000, output="wide")
## Getting data from the 2000 decennial Census
## Using Census Summary File 1
pop2010 <- get_decennial(geography = "state",
                         variables = c(pop10="P001001"), 
                         year = 2010, output="wide")
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look at the output

What column(s) can we use to merge these two dataframes?

head(pop2000, 3)
## # A tibble: 3 x 3
##   GEOID NAME      pop00
##   <chr> <chr>     <dbl>
## 1 01    Alabama 4447100
## 2 02    Alaska   626932
## 3 04    Arizona 5130632
head(pop2010, 3)
## # A tibble: 3 x 3
##   GEOID NAME      pop10
##   <chr> <chr>     <dbl>
## 1 01    Alabama 4779736
## 2 02    Alaska   710231
## 3 04    Arizona 6392017

Merge population by state from both censuses

Save in a new dataframe with both columns

pop2000_2010 <- pop2000 %>% merge(pop2010, by="NAME") %>%
                             select(NAME, pop00, pop10)

head(pop2000_2010,3)
##      NAME   pop00   pop10
## 1 Alabama 4447100 4779736
## 2  Alaska  626932  710231
## 3 Arizona 5130632 6392017

Save the data

Use write.csv to save a data frame to a CSV file.

write.csv(pop2000_2010, file="data_out/pop2000_2010.csv", row.names = FALSE)

Any QUESTIONS?

Part 2. Mapping

Mapping Census Data with tidycensus

You can fetch geographic data by adding the parameter geometry=TRUE to tidycensus functions

  • Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

  • Only a subset of data available via tigris can be accessed via tidycensus.

You can then use your favorite mapping functions or libraries like plot, ggplot and tmap to make maps.

Geometry Options

Before fetching census geographic data, we need to set the option tigris_use_cache to TRUE

Caching greatly speeds things up if you fetch the same census geographic data repeatedly.

# Tigris options - used by tidycensus
# Cache retrieved geographic data locally
options(tigris_use_cache = TRUE)  

Fetch geographic boundary data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry with the data for mapping
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look

Let’s take a minute to discuss the format of an sf spatial object.

head(pop2010geo, 3)
## Simple feature collection with 3 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -90.41814 ymin: 41.23796 xmax: -66.9499 ymax: 48.19097
## geographic CRS: NAD83
## # A tibble: 3 x 4
##   GEOID NAME        pop10                                               geometry
##   <chr> <chr>       <dbl>                                     <MULTIPOLYGON [°]>
## 1 23    Maine      1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 …
## 2 25    Massachus… 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 4…
## 3 26    Michigan   9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 …

Geospatial Data in R

R sf objects include

  • a dataframe with a geometry column named of geometry

    • The geometry can be of type POINT, LINE, POLYGON
    • or, MULTIPOINT, MULTILINE or MULTIPOLGYON
  • a CRS (coordinate reference system), specified by

    • epsg(SRID) code
    • proj4string

For a deeper understanding of the sf package and its functionality, we recommend our Geospatial-Fundamentals-in-R-with-sf workshop.

Census Data Coordinate Reference System (CRS)

All census geographic data use the NAD83 CRS, or coordinate reference system. NAD83 stands for North American Datum of 1983. The geographic coordinates are longitude and latitude values encoded as decimal degrees.

WGS84, or The World Geodetic System of 1984 is the most commonly used geographic CRS. The difference between points in these systems varies up to 1 meter in continental US.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.

Mapping sf Spatial Objects

We can use plot to make a quick map the geometry stored in an sf spatial object.

plot(pop2010geo$geometry)

Question

What do you get if you plot the sf object without specifying “$geometry”

Try it!

plot(pop2010geo)

The Challenge of US maps

The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.

Fetch geographic data with tidycensus, SHIFTED

tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.

Shift Happens!

plot(pop2010geo_shifted$geometry)

Save it

You can save any sf data object to a shapefile using st_write

st_write(pop2010geo_shifted, "data_out/usa_pop2010_shifted.shp")

Check it out

# Check to see if the data was written out to a shapefile
dir("data_out") 

Mapping Data Values

Use the sf plot command to make a map that color codes the geometry by the column values

plot(pop2010geo_shifted['pop10'])  # a choropleth map!

ggplot2 Map

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()  # tells ggplot that geographic data are being plotted

Challenge

Create a map of CA Population in 2010 by county

Challenge Solution

2010 pop Data for California Counties

#fetch it
cal_pop10 <- get_decennial(geography = "county", 
                           variables = "P001001",
                           year = 2010, 
                           state='CA',
                           geometry=TRUE)

# map it
plot(cal_pop10['value'])

Fetch County data for more than one state

We can fetch the census data and the geometry for more than one state with same function call

  • This is so much easier than any alternative approach!
  • It can be applied to other geographic tabulation areas (tracts, places) and area filters (state, county)
west_pop10 <- get_decennial(geography = "county", 
                           variables =  "P001001",
                           year = 2010, 
                           state=c('CA', 'NV'),
                           geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Map it

These are just quick plots to make sure we got the right data!

plot(west_pop10['value'])

Challenge

Fetch and map the 2010 population by census tract for Alameda and Contra Costa counties.

Challenge Solution

Fetch Tract population & geometry data for Alameda & Contra Costa Counties

alcc_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=c("Alameda","Contra Costa"),
                      geometry=T) 
## Getting data from the 2010 decennial Census

Challenge Solution

Map it

plot(alcc_pop10['value'])

More Complex Query

Let’s use the 2010 census data to map the percent of San Francisco properties that were rented

To start, identify the variables for the

  • total number of housing units

  • number of renter occupied units

Complete the query

sf_rented <- get_decennial(geography =  ,  # census tabulation unit
                           variables =   , # number of households rented
                           year =  , 
                           summary_var = ,  # Total households
                           state=,
                           county=,
                           geometry=)

SF Percent Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004", #number of households rented
                           year = 2010, 
                           summary_var = "H004001",  # Total households
                           state='CA',
                           county='San Francisco',
                           geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look at the output

How to get the percent of units that were rented?

head(sf_rented)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79121 xmax: -122.3996 ymax: 37.81144
## geographic CRS: NAD83
## # A tibble: 6 x 6
##   GEOID  NAME       variable value summary_value                        geometry
##   <chr>  <chr>      <chr>    <dbl>         <dbl>              <MULTIPOLYGON [°]>
## 1 06075… Census Tr… H004004   1707          2090 (((-122.4206 37.81111, -122.40…
## 2 06075… Census Tr… H004004   1830          2544 (((-122.425 37.811, -122.4242 …
## 3 06075… Census Tr… H004004   1492          2026 (((-122.4149 37.80354, -122.41…
## 4 06075… Census Tr… H004004   1741          2479 (((-122.4129 37.80218, -122.41…
## 5 06075… Census Tr… H004004   1792          2338 (((-122.4117 37.79629, -122.41…
## 6 06075… Census Tr… H004004   1418          1858 (((-122.4092 37.79204, -122.41…

Percent of rented

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

# Take a look
head(sf_pct_rented)
## Simple feature collection with 6 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.4267 ymin: 37.79121 xmax: -122.3996 ymax: 37.81144
## geographic CRS: NAD83
## # A tibble: 6 x 7
##   GEOID  NAME    variable value summary_value                     geometry   pct
##   <chr>  <chr>   <chr>    <dbl>         <dbl>           <MULTIPOLYGON [°]> <dbl>
## 1 06075… Census… H004004   1707          2090 (((-122.4206 37.81111, -122…  81.7
## 2 06075… Census… H004004   1830          2544 (((-122.425 37.811, -122.42…  71.9
## 3 06075… Census… H004004   1492          2026 (((-122.4149 37.80354, -122…  73.6
## 4 06075… Census… H004004   1741          2479 (((-122.4129 37.80218, -122…  70.2
## 5 06075… Census… H004004   1792          2338 (((-122.4117 37.79629, -122…  76.6
## 6 06075… Census… H004004   1418          1858 (((-122.4092 37.79204, -122…  76.3

Map the result

plot(sf_pct_rented['pct'])

Questions?

Part 3. ACS 5 year data

ACS Data with tidycensus

We can use tidycensus to fetch ACS data just like we fetched the decennial census data.

We will use the function get_acs instead of get_decennial

BUT it’s more complex workflow because

  1. there are a lot more ACS tables and variables

  2. Because the ACS contains sample data, each ACS variable that you retrieve with tidycensus will fetch both an estimate of the value and a margin of error.

Fetch List of ACS 5 year Variables

Use the load_variables function to get a dataframe of all variables from the ACS 2015-2019 5 year dataset

Then View the dataset and filter for variables related to median household income

acs2019vars <- load_variables(year=2019, dataset = 'acs5', cache = T)

# Review and filter the dataframe of ACS variables
#View(acs2016vars)

Fetch Data on Median Household Income

Let’s fetch the median household income data for Alameda County

alco_mhhincome <- get_acs(geography='tract',
                        variables=c(median_hhincome = "B19013_001"),
                        year = 2019,
                        state='CA',
                        county='Alameda',
                        geometry=T
                        )
## Getting data from the 2015-2019 5-year ACS

Take a look

head(alco_mhhincome)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.2887 ymin: 37.52248 xmax: -121.8779 ymax: 37.81562
## geographic CRS: NAD83
##         GEOID                                             NAME        variable
## 1 06001442301 Census Tract 4423.01, Alameda County, California median_hhincome
## 2 06001437400    Census Tract 4374, Alameda County, California median_hhincome
## 3 06001437701 Census Tract 4377.01, Alameda County, California median_hhincome
## 4 06001402400    Census Tract 4024, Alameda County, California median_hhincome
## 5 06001402500    Census Tract 4025, Alameda County, California median_hhincome
## 6 06001450743 Census Tract 4507.43, Alameda County, California median_hhincome
##   estimate   moe                       geometry
## 1   110761 21966 MULTIPOLYGON (((-121.9701 3...
## 2    86210  9325 MULTIPOLYGON (((-122.0926 3...
## 3    64559  6732 MULTIPOLYGON (((-122.0747 3...
## 4    39913  8581 MULTIPOLYGON (((-122.284 37...
## 5    30000 12436 MULTIPOLYGON (((-122.2879 3...
## 6   128737  9289 MULTIPOLYGON (((-121.9066 3...

Map it

plot(alco_mhhincome['???'])

Map it

plot(alco_mhhincome['estimate'])

Fetching multiple variables

First define the set of variables of interest.

# Median Household income by Race - variables from ACS 2015-2019
inc_by_race <- c(All =   "B19013_001",
                 White = "B19013H_001",
                 Black = "B19013B_001",
                 Asian = "B19013D_001",
                 Hispanic = "B19013I_001" )

Fetch the data

Fetch census tract data for multiple variables at once

alco_mhhincome_by_race <- get_acs(geography='tract',
                        variables=inc_by_race,
                        year = 2019,
                        state='CA',
                        county='Alameda',
                        geometry=T )
## Getting data from the 2015-2019 5-year ACS

Facet Map

Facet maps make it easy to create visualizations of small multiples, or subsets of the data that facilitate comparisons. Here, we use ggplot to make multiple maps of income by race for Alameda County.

medhhinc_facet_map <- alco_mhhincome_by_race %>%
                        ggplot(aes(fill = estimate)) +
                          facet_wrap(~variable) +
                          geom_sf(color=NA) +
                          scale_fill_viridis_c()

Facet Map Output

medhhinc_facet_map

Wide Output

…because sometimes you don’t want tidy format

alco_mhhincome_by_race2 <- get_acs(geography='tract',
                                  variables=inc_by_race,
                                  year = 2019,
                                  state='CA',
                                  county='Alameda',
                                  geometry=T,
                                  output="wide")
## Getting data from the 2015-2019 5-year ACS

Wide output

head(alco_mhhincome_by_race2)
## Simple feature collection with 6 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.2887 ymin: 37.52248 xmax: -121.8779 ymax: 37.81562
## geographic CRS: NAD83
##         GEOID                                             NAME   AllE  AllM
## 1 06001442301 Census Tract 4423.01, Alameda County, California 110761 21966
## 2 06001437400    Census Tract 4374, Alameda County, California  86210  9325
## 3 06001437701 Census Tract 4377.01, Alameda County, California  64559  6732
## 4 06001402400    Census Tract 4024, Alameda County, California  39913  8581
## 5 06001402500    Census Tract 4025, Alameda County, California  30000 12436
## 6 06001450743 Census Tract 4507.43, Alameda County, California 128737  9289
##   WhiteE WhiteM BlackE BlackM AsianE AsianM HispanicE HispanicM
## 1  87686  27850     NA     NA 132071  19754    104336     26940
## 2  83417  11963 107656  37219 122692  17395     75645      9909
## 3  74000  35763  58000  29756  62262  22638     64375      8638
## 4 137938  69610  31989  19980  11818   4976        NA        NA
## 5     NA     NA  20556   5948  53523  51287        NA        NA
## 6 109671  26749     NA     NA 131350  15507    136250     42576
##                         geometry
## 1 MULTIPOLYGON (((-121.9701 3...
## 2 MULTIPOLYGON (((-122.0926 3...
## 3 MULTIPOLYGON (((-122.0747 3...
## 4 MULTIPOLYGON (((-122.284 37...
## 5 MULTIPOLYGON (((-122.2879 3...
## 6 MULTIPOLYGON (((-121.9066 3...

Challenge

Make a map of MEDIAN GROSS RENT in Alameda and Contra Costa Counties by tract using data from the ACS 2015-2019 5 year product

alcc_medrent <- get_acs(geography= ,
                              variables= ,
                              year = ,
                              state= ,
                              county= ,
                              geometry=)

Challenge Solution

alcc_medrent <- get_acs(geography="tract",
                              variables=c(median_rent2019="B25064_001"),
                              year =2019,
                              state="CA",
                              county=c("Alameda","Contra Costa"),
                              geometry=T)
## Getting data from the 2015-2019 5-year ACS
# Uncomment to view map
#plot(alcc_medrent['estimate'])

Interactive Mapping

Interactive Mapping

Interactive mapping gives the RStudio environment some of the functionality of desktop GIS.

There are a number of R packages tat you can use, including:

  • mapview: quick interactive exploratory data viewing

  • tmap: great static and interactive maps

  • Leaflet: highly customizable interactive maps

All of these are based on the Leaflet Javascript Library.

Mapview

Let’s use mapview to make some quick interactive maps of our median hhousehold income data

mapview(alco_mhhincome_by_race2)

Interactive Choropleth map

mapview(alco_mhhincome_by_race2, zcol="AllE")

Challenge

Use Mapview to create a map of median household income (alcc_medrent)

Challenge Solution

mapview(alcc_medrent, zcol='estimate')

Any Questions?

Figuring out the ACS Variables to use

ACS variables can be confusing.

Some ways to identify the best variables to explore:

Web search, especially Census web resources

The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

The NHGIS website (nhgis.org) is a great way to browse variables of interest

Margins of Error (MOE)

We haven’t talked about it but it may be important in your work with ACS data.

Math is needed to combine MOEs when you combine variables.

  • tidycensus includes some nice functions for these calculations.

See this web page on how to handle MOEs in tidycensus

Summary

Summary

tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial

  • Support for fetching Public Use Microdata Sample (PUMS) data and Population Estimate data recently added

Using tidycensus to fetch the tabular data or both tabular and geographic data is IMO way easier than any alternatives, IF you (1) know R, (2) know a bit about working with geographic data in R.

  • This approach is also scaleable if you want multiple census variables for various locations.

Summary

You can greatly enhance your maps if you make them with ggplot2 rather than the default plot command.

Interactive mapping greatly enhances your ability to do exploratory data analysis in RStudio.

References